You can tell when a VCF file contains a phased genotype as the delimiter used in the GT field is a pipe symbol | e.g
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
10 60523 rs148087467 T G 100 PASS AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28
The VCF files produced by the final phase of the 1000 Genomes Project (phase 3) are phased. They can be found in the final release directory from the project and in the directory supporting the final publications.
The majority of the VCF files in official releases over the life time of the project contained phased variants. This is also true for the pilot, phase 1 and final phase 3 data sets.
The phase 1 release files contain global R2 values but you can also use the VCF to plink converter if you wish to use our files with haploview or another similar tool.
No. While bi-allelic calling was used in earlier phases of the 1000 Genomes Project, multi-allelic SNPs, indels, and a diverse set of structural variants (SVs) were called in the final phase 3 call set. More information can be found in the main phase 3 publication from the 1000 Genomes Project and the structural variation publication. The supplementary information for both papers provides further detail.
In earlier phases of the 1000 Genomes Project, the programs used for genotyping were unable to genotype sites with more than two alleles. In most cases, the highest frequency alternative allele was chosen and genotyped. Depth of coverage, base quality and mapping quality were also used when making this decision. This was the approach used in phase 1 of the 1000 Genomes Project. As methods were developed during the 1000 Genomes Project, it is recommended to use the final phase 3 data in preference to earlier call sets.
There are a number of tools available in the Tools page of the 1000 Genomes Browser.
Our data is in standard formats like SAM and VCF, which have tools associated with them. To manipulate SAM/BAM files look at SAMtools for a C based toolkit and links to APIs in other languages. To interact with VCF files look at VCFtools which is a set of Perl and C++ code.
We also provide a public MySQL instance with copies of the databases behind the 1000 Genomes Ensembl browsers. These databases are described on our public instance page.
We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.
An example Perl command to run the script would be:
perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
-sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
-region 13:32889611-32973805 -population GBR -population FIN
Either the Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.
The Data Slicer, described in more detail in the documentation, has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.
To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz
The final data set produced by the 1000 Genomes Project was the phase 3 integrated data set. This contains fully phased haplotypes for 2,504 individuals. Full details can be found in the 1000 Genomes project phase 3 publication.
The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.
Please look at the software’s website to find those files.
Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.
An example info column which contains this information looks like
1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP
If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in our browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.
This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac
An example command set using files from our phase 1 release would look like
grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list
vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
bgzip -c > CEU.chr13.phase1.vcf.gz
</pre>
Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).
Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.
There are two ways to get a subset of a VCF file.
The first is to use the Data Slicer tool from our browser which is documented here. This tool gives you a web interface requesting the URL of any VCF file and the genomic location you wish to get a sub-slice for. This tool also works for BAM files. This tool also allows you to filter the file for particular individuals or populations if you also provide a panel file.
The second method is using tabix on the command line. e.g
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768
Specifications for the VCF format, and a C++ and Perl tool set for VCF files can be found at vcftools on sourceforge
Please note that all our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a vcf file using a chromosome name in the style chrN as shown below it will not work.
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768
The Data Slicer is a web based tool in our browser which allows you to get subsections of our indexed VCF and BAM files.
The Phase 1 integrated variant set does not report the depth of coverage for each individual at each site. We instead report genotype likelihoods and dosage. If you would like to see depth of coverage numbers you will need to calculate them directly.
The bedtools suite provides a method to do this.
genomeCoverageBed is a tool which can provide a bed file which specifies coverage for every base in the genome and intersectBed which will provide an intersection between two vcf/bed/bam files
These commands also require samtools, tabix and vcftools to be installed
An example set of commands would be
samtools view -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG01375/alignment/HG01375.mapped.ILLUMINA.bwa.CLM.low_coverage.20120522.bam 2:1,000,000-2,000,000 | genomeCoverageBed -ibam stdin -bg > coverage.bg
This command gives you a bedgraph file of the coverage of the HG01375 bam between 2:1,000,000-2,000,000
tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chr2.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz 2:1,000,000-2,000,000 | vcf-subset -c HG01375 | bgzip -c > HG01375.vcf.gz
This command gives you the vcf file for 2:1,000,000-2,000,000 with just the genotypes for HG01375
To get the coverage for all those sites you would use
intersectBed -a HG01375.vcf.gz -b coverage.bg -wb > depth_numbers.vcf
You can find more information about bed file formats please see the Ensembl File Formats Help
For more information you may wish to look at our documentation about data slicing
Our variant files are distributed in vcf format, a format initially designed for the 1000 Genomes Project which has seen wider community adoption.
The majority of our vcf files are named in the form:
**<span style”color:red”>ALL</span>.<span style”color:blue”>chrN</span> | <span style”color:green”>wgs | wex</span>.<span style”color:orange”>2of4intersection</span>.<span style”color:violet”>20100804</span>.<span style”color:darkblue”>snps | indels | sv</span>.genotypes.<span style”color:darkred”>analysis_group</span>.vcf.gz**. |
This name starts with the <span style”color:red”>population</span> that the variants were discovered in, if ALL is specifed it means all the individuals available at that date were used. Then the <span style”color:blue”>region</span> covered by the call set, this can be a chromosome, <span style”color:green”>wgs</span> (which means the file contains at least all the autosomes) or <span style”color:green”>wex</span> (this represents the whole exome) and a <span style”color:orange”>description</span> of how the call set was produced or who produced it, the <span style”color:violet”>date</span> matches the sequence and alignment freezes used to generate the variant call set. Next a field which describes what <span style”color:darkblue”>type of variant</span> the file contains, then the <span style”color:darkred”>analysis group</span> used to generate the variant calls, this should be low coverage, exome or integrated and finally we have either sites or genotypes. A sites file just contains the first 8 columns of the vcf format and the genotypes files contain individual genotype data as well.
Release directories should also contain panel files which also describe what individuals the variants have genotypes for and what populations those individuals are from
LDAF is an allele frequency value in the info column of our phase 1 VCF files.
Our standard AF values are allele frequencies rounded to 2 decimal places calculated using allele count (AC) and allele number (AN) values. LDAF is the allele frequency as inferred from the haplotype estimation.
You will note that LDAF does sometimes differ from the AF calculated on the basis of allele count and allele number. This generally means there are many uncertain genotypes for this site. This is particularly true close to the ends of the chromosomes.
All of the 1000 Genomes SNPs and indels have been submitted to dbSNP, and will have rsIDs in the main 1000 Genomes release files. The SVs have all been submitted to DGVa and have esvIDs in the main files.
If you are using some of the older working files that were used during the data gathering phase of the 1000 Genomes Project, you may find some variants with other kinds of identifiers, such as Alu_umary_Alu_###. These identifiers were created internally by the groups that did that set of particular variant calling, and are not found anywhere other than these files, as they will have been replaced by official IDs in the later files.
All our variant call releases since 20100804 have come with a panel file. This file lists all the individuals who are part of the release and the population they come from.
This is a tab delimited file which must have sample and population in its first two columns; some files may then have subsequent columns which describe additional information like which super population a sample comes from or what sequencing platforms have been used to generate sequence data for that sample.
The panel files have names like integrated_call_samples.20101123.ALL.panel or integrated_call_samples_v2.20130502.ALL.panel
These panel files can be used by our browser tools, the Data Slicer, Variant Pattern Finder and vcf to ped converter to establish population groups for filtering
All the variants in both our VCF files and on the browser are always reported on the forward strand.
The project has two releases of structural variation. The pilot paper data directory contains vcf files for deletions, mobile element insertions, tandem duplications and novel sequence both for the low coverage and trio pilot studies. Our phase1 release integrated release contains deletions together with the SNPs and short indels.
The VCF files on our site cover a wide variety of different versions but our most recent release VCF files are in format version 4.1
In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to 2 decimal places.
The phase 3 VCF files released in June 2014 contain overlapping and duplicate sites.
This is due to an error in the processing pipeline used when sets of variant calls were combined. Originally, all multi-allelic sites were seperated into individual lines in the VCF file during the pipeline but the recombination process did not always succeed, leaving us with a small number of sites with overlapping or duplicate call records. This is most commonly seen in chromosome X.
The simplest solution to this is to ignore duplicate sites in any analysis. If you wish to use one or both of a pair of duplicate sites in your own analysis, you should use the GRCh37 alignment files to recall the genotypes of interest in the individuals you are interested in to resolve the conflict.
Our August 2010 call set represents a merge of various different independent call sets. Not all the call sets in the merge had genotypes associated with them, as this merge was carried out using a predefined rules which has led to individuals or whole variant sites having no genotype and this is described as ./. in vcf 4.0. In our November 2010 call and all subsequent call sets all sites have genotypes for all individuals for chr1-22 and X.